1 Introduction

The COVID-19 pandemic, also known as the Coronavirus pandemic, is an ongoing global crisis that caused significant alterations to academia, demanding new regulations and creating unprecedented challenges for both learners and tutors [1]. In order to minimize the transmission of the contagious virus, students have to study from home. Education systems need to provide online system strategies for teaching, learning, and evaluation methods to help with this transition. Besides the current demand for online education as an effect of the pandemic, some of the new practices imposed by the current pandemic situation can be maintained and used even after the crisis [2]. Investigation and analysis of how pandemic effects academic activities help us overcome current challenges. We can use this experience to enhance our academic measures and advance online education capabilities [3].

With the outbreak of Coronavirus (COVID-19) disease, online exams became common practice for academic evaluation. Online exams offer several desirable advantages such as time efficiency [4], ease of use [5], enhanced adaptability [6, 7], and provision of immediate feedback [8]. On the flip side, computer and internet accessibility [9], lack of experience with computer or online assessment processes [10], test anxiety [11], and higher cheating rates [12, 13] are some of the main challenges that come with online exams.

Given the critical pandemic situation, online exams are inevitable and will increase even in non-critical situations. Therefore, in order to hold them more fairly, methods should be considered, and possible failures should be identified to be mitigated or eliminated as a precaution. Therefore, the basic questions, or in other words, the objectives of our research, are as follows:

  • What is the definition of a fair exam?

  • Who are the customers of an online test process, and what are their needs?

  • What is the priority and importance of each of these needs for them?

  • What characteristics of the process can be effective in meeting these needs, and to what extent?

The ultimate goal is to provide a list of things that we can do to have a fairer online exam.

Fairness is often regarded as the most important pillar of examinations, which strongly affects students [14, 15]. Exam fairness preserves academic integrity and improves the students' motivation to enhance their performance [16, 17]. There are numerous challenges to fairness in online exams, such as limited proctoring options and higher cheating rates [18].

The current circumstances and the necessity to employ online exams while eliminating their shortcomings exhibit the demand for an effective algorithm. Failure mode and effects analysis (FMEA) can be a robust tool for this matter. FMEA is a widely used technique to diagnose and prevent product, system, and operation failure modes before occurrence [19]. As Lolli et al. mentioned in their work in 2016, FMEA is primarily performed by providing a list of potential failure modes, assigning numbers associated with the severity, detection, and probability of occurrence to each of these events, and eventually obtaining the risk priority number or RPN from the multiplication of these numbers. The performance of FMEA relies entirely on proper determination of the numbers of intensity, detection, and occurrence, and thus the RPN values. For the intensity and detection numbers, which are essentially subjective values, this is less of a challenge than the occurrence number, which has an objective nature [20].

K-means clustering method is one of the plainest yet most commonly used unsupervised intelligent learning algorithms. It can help us prevent conflicting situations, especially in the assignment of occurrence probability numbers [20].

In 2002, Berget and Naes [21] introduced a fuzzy K-means-based clustering algorithm for sorting raw materials to improve the quality of the final product, which works similarly to an optimization problem. In 2004, Sarkar [22] proposed a clustering algorithm for failure modes to investigate the probabilities of each state occurring. Also, in 2014, Lolli et al. [23] presented an application of K-Means for sorting according to multi-criteria classification, the key information of which can be the basis for presenting an algorithm with our intended purpose. In a similar work in 2016, Lolli et al. [20] used K-Means to resolve inconsistencies in the "occurrence" parameter, which is a subjective parameter of FMEA.

In the last step of the FMEA method, a list of preventive and corrective actions is presented to mitigate the occurrence, minimize the effects, or enhance the probability of detecting improper conditions [24]. Contrarily, time and cost limitations make it impracticable to use all offered solutions to eliminate every unfavorable situation. Therefore, we need to rank and prioritize recommended corrective actions.

Quality function deployment (QFD) is an effective and robust means commonly used to design engineering products aiming to reach maximum customer satisfaction. In this method, customer needs are associated with the product's technical characteristics in the QFD matrix. Eventually, the QFD process results in a ranked and weighted list of technical product features [25, 26]. In an analogy, failure modes are regarded as customer needs, the RPN as the priority of needs, and the listed preventive and corrective actions as the product features. These entities are supplied to the QFD. The final goal of this algorithm is to present a weighted list of corrective and preventive actions as the output.

In the event of crises such as the Covid-19 pandemic and the increasing demand for online testing, along with time and resource limitations that are more severe at this time, it is essential to have a preventive algorithm for the effective allocation of financial and time resources. The main innovation of the proposed algorithm is to simulate the online test process with an engineering product and then simultaneously use tools FMEA, K-Means, and QFD to design it. The most important advantages of such an algorithm are as follows:

The proposed method in this research first identifies all groups that are internal or external customers of this process. It is based on a survey of all customers, to be a comprehensive approach. One of the basic foundations of the proposed method is FMEA, which is inherently preventive in nature. Therefore, our algorithm is preventive and so deals with the prevention of the faults, instead of repairing them after their occurrence. This issue is very effective in increasing the efficiency of activities in times of crisis. Also, FMEA has contradictions that have been largely resolved in the proposed algorithm using K-Means. Employing QFD, as a tool based on maximum customer satisfaction, is very efficient in resource allocation. Therefore, time and financial resources, that are limited especially in times of crisis, will be spent on activities that ultimately lead to greater process customer satisfaction.

The proposed algorithm has been implemented on mechanical engineering students at the Sharif University of Technology for two consecutive semesters. This paper aims to improve exam fairness by analyzing the worries and challenges that students of the Sharif University of Technology have experienced during their online exams in times of the COVID-19 pandemic. The results are presented and investigated in this paper.

2 Materials and methods

We aim to provide an algorithm that can be deployed to identify existing and potential defects of a fair online exam. Then, find and prioritize possible solutions. The prioritization is necessary since it is impossible to implement all possible solutions regarding time and cost limitations. So, we can only apply the most effective solutions and disregard less effective ones.

For this, it is necessary to define a fair exam at first, and then, according to its characteristics, potential failure modes and their effects should be identified. Solutions to eliminate or reduce the effects should be provided and prioritized.

2.1 Definition of a fair exam

In an online survey, we asked college students and professors to provide their definitions of a fair exam. Additionally, they were requested to list potential problems that they have encountered, describe their effects, and suggest solutions for more fairness. Twelve university professors and 118 students participated in the survey. In order to have a relatively homogeneous statistical population that covers a broad spectrum, in the group of professors, three people are in mathematics and engineering, three in medicine, three in humanities, and three in art. Three people in each category included a highly experienced professor (more than 20 years of experience), a moderate professor (between 10 and 20 years of teaching), and a young professor (less than 10 years of experience). Also, from each of the disciplines mentioned in the professors' group, 30 students were selected with a combination of 10 students with a GPA of A, 10 students with a GPA of B, and 10 students with a GPA of C. In the art group, the survey of two students with a GPA of C was invalid and resulted in a total of 118 students. Summarizing the commonalities and rewriting their views led to the following definition:

A fair assessment occurs when participants' knowledge of the presented topics is measured appropriately, they have equal conditions, and they are fully justified with the outcome [15]. Moving toward the above expressions will lead to a fairer exam.

2.2 Basic FMEA

FMEA is a powerful engineering tool for the identification of potential failure modes and their sources. This process is done through thinking about a product, process, or service in reverse [27]. In this method, the effect of each failure mode on the customer is represented by the severity number (S). Likewise, the likelihood of detecting a failure when it occurs is shown by the detection number (D), and the probability of its occurrence is reported by the occurrence number (O). These three numbers lie within the range of 1 to 10. Higher severity and probability of occurrence lead to larger O and S numbers. The D number becomes larger when preventive detection of the failure mode is unlikely. The risk priority number (RPN) is:

$${\text{RPN}} = S \times D \times O$$
(1)

where RPN ranges between 1 and 1000, and higher numbers indicate a risk of the failure mode [28,29,30,31]. The scales used to determine the S, O, and D values are provided in Table 1 [32, 33].

Table 1 FMEA Scale for severity (S), occurrence (O), and detection (D) numbers [32, 33]

2.3 Modifying basic FMEA using K-means clustering

Proper determination of RPN relies on the correct assignment of S, D, and O values. The nature of these numbers implies that S and D are subjective, but O is objective. Particularly, the magnitude of O depends on the occurrence records of a failure mode. Suppose that O values lie within the range of 1–10, which suggests that occurrence probabilities are divided into ten distinctive classes. Consequently, if a type of failure occurs up to 2000 times a year, the range of each class will be 200. This is shown in Fig. 1.

Fig. 1
figure 1

Occurrence classes in the mentioned example (with a minimum of 0 and a maximum of 2000 occurrences per year)

Now, assume a failure mode occurs 596 times, and another failure mode happens 604 times. In this case, the first failure will be in the third class, while the second one will be in the fourth class, knowing that it happened only eight times more than the first one. This paradox casts doubt on the accuracy of occurrence number assignments.

We use the intelligent, nonlinear clustering method of k-means to resolve this issue. In this algorithm, k cluster centers are randomly selected, where k is user-specified. In the next step, the Euclidean distance between each point and the cluster centers is measured. Each point is assigned to the cluster with the most adjacent center. When all existing points are allocated, new centroids are recalculated by averaging between each cluster's members. When all existing points got allocated to different centers, new centers are recalculated by averaging between each cluster's members. This process continues until the predetermined ending condition is fulfilled [34]. K-means clustering method is an unsupervised learning algorithm [35, 36]. For assigning the Occurrence number, we divide the range into ten classes. In order to assign the Occurrence number, the range was divided into ten classes. Then, the midpoint of each class, along with other data points, was given to k-means as input. Since k-means does not leave any cluster empty, this process excludes the risk of placing two points with a close number of occurrences in two separate clusters. Similarly, it is unlikely for two far values to end up in two consecutive classes. Consequently, the paradox with the results will be resolved [37].

2.4 Modifying risk priority numbers using fuzzy logic

High intensity, regardless of RPN, means high risk [38]. Because, even if the probability of its occurrence is low or the possibility of its preventive detection is high, it can lead to adverse effects on the process customers. Therefore, risky situations are the sum of failure modes with a high RPN plus high severity cases. The combination of these two factors can be done in different methods, but it depends entirely on the nature of the factors and the way of human inference. In such conditions, the closest tool to human inference is a fuzzy logic-based system [39].

The most common concepts of fuzzy systems are pure fuzzy, fuzzy Sugeno Takagi base, and Mamdani base [33, 40]. In the case of human inferences, which require the use of expert knowledge with linguistic variables, fuzzification of them, inference, and then defuzzification, the most appropriate option is fuzzy systems based on the Mamdani algorithm [15].

To achieve this goal, a fuzzy inference system has been formed, with two inputs and one output. The shape of the membership functions of the inputs and output, which are of type Trimf (Triangular-shaped membership function), is as shown in Fig. 2.

Fig. 2
figure 2

Shape of the membership functions of a input variable “Severity”, b input variable “RRPN”, and c output variable “MRPN” for fuzzy system “Risk”

Also, the fuzzy rules and its inference system are as follows:

  1. 1.

    If (Severity is Low) and (RRPN is Low) then (MRPN is Low).

  2. 2.

    If (Severity is Low) and (RRPN is Moderate) then (MRPN is Low).

  3. 3.

    If (Severity is Low) and (RRPN is High) then (MRPN is Moderate).

  4. 4.

    If (Severity is Moderate) and (RRPN is Low) then (MRPN is Low).

  5. 5.

    If (Severity is Moderate) and (RRPN is Moderate) then (MRPN is Moderate).

  6. 6.

    If (Severity is Moderate) and (RRPN is High) then (MRPN is High).

  7. 7.

    If (Severity is High) then (MRPN is High).

The result of fuzzy rules and the relationship of the inputs to the output is according to the surface drawn in Fig. 3.

Fig. 3
figure 3

Surface plot for fuzzy system “Risk”

Therefore, if we call this fuzzy system as "Risk", we can say that:

$${\text{RRPN}}_{j} = {\text{Risk}}\left( {{\text{MRPN}}_{j} \cdot S_{j} } \right)$$
(2)

where MRPN is the modified value of RPN, assuming the high values of severity are risky, and MRPN is in the range of 0 and 100.

2.5 Prioritizing actions using QFD

After determining the RPN value, possible preventive and corrective actions are determined for each failure mode. The quality function deployment (QFD) is used to determine the priority of each proposed solution. QFD is a customer-oriented method in designing new engineering products, aiming to maximize customer satisfaction [26, 27, 41]. The main idea of QFD is to provide a list of prioritized customer needs related to the product. Then, the technical characteristics of the product are specified. The QFD matrix, shown in (3), is the mapping of needs to technical characteristics of the product [25, 42, 43]:

(3)

where Wij shows how much the jth technical characteristic meets the ith need. Aj is a technical characteristic, and Ri is the priority number for ith need. Now, the weight of each technical feature is calculated by Eq. (4):

$$W_{j} = \mathop \sum \limits_{j = 1}^{m} R_{i} W_{ij}$$
(4)
$$W_{j}^{N} = \frac{{W_{j} }}{{\mathop \sum \nolimits_{j = 1}^{n} W_{j} }} = \frac{{\mathop \sum \nolimits_{i = 1}^{m} R_{i} W_{ij} }}{{\mathop \sum \nolimits_{j = 1}^{n} \mathop \sum \nolimits_{i = 1}^{m} R_{i} W_{ij} }}$$
(5)

Equation (5) shows the normalized weight.

2.6 The proposed algorithm

In an analogy with the design of an engineering product, the steps for performing the proposed algorithm will be as follows:

Step 1 Identifying the Customers:

Customers of an online exam process fall into two categories: "professors and assistants" as group A and "students" as group B. In expressing the reason for classification and in an analogy with engineering products, process customers can be classified into two categories: "Manufacturers and Service Providers" (domestic customers) and "Consumers" (foreign customers). Here, professors and assistants are as manufacturers and service providers and students as consumers. Also, their opinions about possible failure modes are considered as the voice of the customer (VOC) or customer complaint. Suppose the number of people in group A is NA, and the number of people in group B is NB.

Step 2 Exploration of potential failure modes:

Using a survey of groups A and B, all possible failure modes are identified. Each failure mode is called Fj. Suppose the total number of failure modes is m. Therefore:

$$F = \left[ {F_{j} } \right],\quad j = 1\;{\text{to}}\;m$$
(6)

where F is the set of failure modes.

Step 3 Determine severity numbers (S):

For each Fj, determine the values \(\overline{S}_{A}^{j}\) and \(\overline{S}_{B}^{j}\), which are the average severity assigned to that failure mode by the individuals in groups A and B, respectively. Then calculate the value of Sj according to Eq. (7):

$$S_{j} = \frac{{\overline{S}_{A}^{j} + \overline{S}_{B}^{j} }}{2}$$
(7)

Step 4 Determine detection numbers (D):

For each Fj, the Dj value is determined, which is the average of the detection number assigned to that failure mode by individuals in group A. (In this case, the poll is conducted only from group A).

Step 5 Identify the repetition of each failure mode:

\(q_{j}^{A}\) Is the value which the failure mode Fj is repeated in group A, and \(q_{j}^{B}\) is the same value in group B. qj, number of repetitions of failure mode Fj, is calculated from Eq. (8):

$$q_{j} = q_{j}^{A} + q_{j}^{B}$$
(8)

Step 6 Calculate the central points of the occurrence intervals:

The maximum and minimum values of qj obtained in step 5 are called qmax and qmin, consequently. Therefore, the center of each occurrence interval can be calculated from (9):

$$q^{\prime }_{l} = q_{\min } + \left( {2l - 1} \right)\left( {\frac{{q_{\max } - q_{\min } }}{20}} \right)$$
(9)

where \(q^{\prime }_{l}\) is the center of the lth interval, and l is a digit from 1 to 10.

Step 7 Calculating Occurrence values (O), using k-means:

Assume the set Q as Eq. (10):

$$Q = \left\{ {q_{j} \cdot q^{\prime } } \right\}, \quad l = 1\;{\text{to}}\;10,\quad j = 1\;{\text{to}}\;m$$
(10)

Then using Matlab-R2013 software, Oj = k-means (Q, 10), obtain the results where, Oj is the number of cluster and shows the occurrence value.

Step 8 Calculate raw RPN (RRPN) value for each failure mode:

Using Eq. (1), \({\text{RRPN}}_{j}\) values for each Fj are calculated (\({\text{RRPN}}_{j} = S_{j} \times D_{j} \times O_{j}\)).

Step 9 Determine Modified RPN (MRPN) value using fuzzy inference system:

Determine MRPN using Eq. (2) by applying fuzzy inference system “Risk”.

Step 10 Extract the possible solutions of each Fj:

This is done using a survey of people in both groups A and B. Similar and close values are conceptually unified. R indicates the total number the solutions (preventive and corrective actions), which we present as the set C:

$$C = \left[ {C_{r} } \right] , \quad r = 1\;{\text{to}}\;R$$
(11)

where Cr is the rth solution.

Step 11 Forming a QFD matrix:

In an analogy to the engineering products, failure modes of online academic exams are given as the customer needs. Here, priority number of each customer demand is MRPN of each failure mode (MRPNj), and suggested solutions will be the product technical characteristics (Cr). To fill the matrix, we acquire the average values from groups A and B. Therefore:

(12)

where Wjr is the effect of the solution Cr on the failure mode Fj. According to (3), the weight of each solution (Wr) will be as (13):

$$W_{r} = \mathop \sum \limits_{j = 1}^{m} \left( {{\text{MRPN}}_{j} \times W_{jr} } \right)$$
(13)

And, the normal weight of each solution (\(W_{r}^{N}\)) is:

$$W_{r}^{N} = \frac{{W_{r} }}{{\mathop \sum \nolimits_{r = 1}^{R} W_{r} }} = \frac{{\mathop \sum \nolimits_{j = 1}^{m} \left( {{\text{MRPN}}_{j} \times W_{jr} } \right)}}{{\mathop \sum \nolimits_{r = 1}^{R} \mathop \sum \nolimits_{j = 1}^{m} \left( {{\text{MRPN}}_{j} \times W_{jr} } \right)}}$$
(14)

Step 12 Prepare a list of preventive and corrective actions along with their priorities:

A prioritized list containing the set \(C = \left\{ {C_{r} } \right\}\), is presented as the result of the algorithm. \(W_{r}^{N}\) shows the solution's weight, which also indicates its priority.

3 Results

Before implementing the proposed algorithm, as mentioned in Sect. 2.1, a survey was conducted to define a fair exam. At the same time, the most significant aspects of the impairment of this definition were asked, and the following 12 attributions were derived:

  • The questions are fully related to the topics

  • The duration of the exam is reasonable

  • Cheating is prevented

  • Appropriate references are taken for evaluation

  • Students have equal access to hardware and software facilities

  • Questions' demands are clear

  • If the questions vary for each student, the level of difficulty should be the same for all of them

  • Scores are distributed reasonably

  • The results are justifiable

  • A Clear statement of evaluation policies and exam details is given before the test

  • The level of questions is proportional to the level of teaching

  • Appropriate time and location are considered for the test.

It is worthy to note that another customer of the process is the "educational system", whose needs are hidden within the needs of the two mentioned groups, with the aim of not prolonging the content and diverging the results. For example, we can say the relevance of the exam content to the taught topics and appropriate references ensures that the training is in line with the objectives of the education system. Prevention of widespread cheating in the exam guarantees the validity of the training provided by the educational system, and clarifying the demands for exams follows the goals of the education system.

Then, the proposed algorithm was implemented in two consecutive semesters (spring 2020 and fall 2021). 80 people, including 60 students, 8 professors, and 12 teaching assistants (20 people in group A and 60 people in group B), participated in it. Based on steps 1 and 2, the results show that a total of 33 potential failure modes (Matrix F) are given in Table 2 (column 3).

Table 2 Potential failure modes and their causes

For more clarifying, these failure modes are classified into these 12 attributes (Column 2). Also, the causes for each one (obtained through surveys) are given in the fourth column of this table.

The Severity number ranges between 1 and 10. Severity numbers above 7, marked by a dashed line shown in Fig. 4, are highly critical and must be treated regardless of their overall RPN number. According to step 3 of Sect. 2.6 and based on Eq. (7), to obtain the numbers related to the severity of each failure mode, the averages are calculated separately in each of groups A and B and listed in columns SjA and SjB, respectively, in Table 3. Also, the average of these two values is calculated and placed in the third column (Sj). Obviously, the average between these two numbers, considering the number of members in each group (20 people in group A and 60 people in group B), indicates that the influence of each person's opinion in group A is more than group B.

Fig. 4
figure 4

Comparative chart of "Severity" values

Table 3 Severity number of failure modes

Then, for detection number, the average value of the detection numbers assigned to each failure mode by individuals in group A is calculated and reported in Table 4. The detection number ranges between 1 and 10. It is divided into three parts: the range 0–3 as easy and obvious diagnosis, the range 3–7 as the average and normal diagnosis, and the range 7–10 as difficult to diagnose. These sections are shown in the diagram with two dashes in Fig. 5.

Fig. 5
figure 5

Comparative chart of "Detection" values

Table 4 Detection numbers of failure modes

According to Sect. 2.6, step 6, the number of repetitions of each failure mode are calculated and presented in Table 5, and the values of qmax = 61 and qmin = 9, are determined. Next, using Eq. (9), the center of each occurrence interval is calculated as follows:

$$q^{{\prime }} = \left\{ {11.6, 16.8, 22.0, 27.2, 32.4, 37.6, 42.8, 48.0, 53.2, 58.4} \right\}$$
Table 5 Number of repetitions of each failure mode

After it, as mentioned in Sect. 2.6, step 7, we form set \(Q = \left\{ {q_{j} \cdot q^{{\prime }} } \right\}\). The occurrence numbers values are obtained for each failure mode, using k-means (Q,10) in MATLAB R-2013 software into 10 categories. The center of clusters obtained from the k-means process is listed in Table 6. The occurrence numbers (O) are arranged in Table 7.

Table 6 Occurrence numbers for each failure mode
Table 7 Occurrence cluster centers obtained from k-means process

Then, using Eq. (1), \(\left( {{\text{RRPN}}_{j} = S_{j} \times D_{j} \times O_{j} } \right)\), RRPNj values are calculated as shown in Table 8. For modifying the value of RRPN, the MRPN is determined using Eq. (2) by applying the fuzzy inference system “Risk”. This can be seen in Table 9.

Table 8 Values of raw risk priority number (RRPN) for failure modes
Table 9 Values of modified risk priority number (MRPN) for failure modes

At the next step, a total of 41 possible solutions for failure modes were extracted using a survey of people in both groups A and B. As mentioned in Sect. 2.6, step 10, similar and close values are conceptually unified and arranged in Table 10 as C1 to C41.

Table 10 Preventive and corrective actions

After listing the solutions, according to step 11 of Sect. 2.6, in an analogy to the engineering products, the QFD matrix was generated. Failure modes are given as the customer needs, and MRPNs are their priority. Suggested solutions are assumed as technical product characteristics (Cr). Then, using average values from groups A and B, the QFD matrix was completed. The weight (here, priority) and normalized weight of each solution were obtained by applying (13) and (14), respectively. The result is presented in Table 11. As mentioned in step 12, this is the final result of the proposed algorithm. Prioritized actions are listed in Table 12.

Table 11 Normalized weight of solutions
Table 12 Normalized change in values of RPN

4 Discussion

According to Table 1, if the severity numbers are in the range of 7–10, they express the major effect of the failure mode on the end-user. Therefore, the number 7 is marked in the diagram with a dividing line as the "threshold". The highest severity numbers in the critical region (F11, F15, F10, F2, and F14) show that the most confusing and dissatisfying effect in an online test is related to credibility and fraud prevention.

Also, if the scoring is not entirely consistent with a specific policy, it can cause severe adverse effects. On the other hand, designing test questions by someone other than the instructor can cause serious problems.

According to Tables 1 and 4 and Fig. 5, only one failure mode is within the difficult detection range, which is F14 (inconsistency in grading). It is quite logical that if the question designer (who should be the instructor himself/herself) does not provide a specific key to grading the exam answer scripts, it will not be easy to identify the consistency of the results.

Considering the numbers in Table 6, which is derived from the proposed k-means system for determining occurrence numbers, and Fig. 6, which is a comparative graph of occurrence values, the failures with the most likely to occur (Containing F13, F5, F19, F22, F27, and F28) do not have high severity. Therefore, it can be concluded that the online exams that have been held so far are mainly at an acceptable level of customer satisfaction, and efforts should be more focused on improving the current level.

Fig. 6
figure 6

Comparative chart of "Occurrence" values

Also, the presence of these failure modes in the list of the high probability shows that the main reasons for the occurrence of failure modes are the way the instructor teaches, the exact expression of expectations, and the appropriateness of time with the questions.

According to Table 9 and Fig. 7, failure modes with the highest MRPN (modified values of the risk priority number) containing F11, F10, F2, F15, and F9, the main critical issue related to an online exam is cheating, which can undermine the validity and the fairness of an exam. Also, the presence of heterogeneity or an incorrect key can disrupt the whole result. On the other hand, if the questions are not from the taught topics, the test is invalid. In general, it can be said that if cheating is prevented, we can hopefully accept the appropriateness of the online exam.

Fig. 7
figure 7

Comparative chart of "MRPN" values

To evaluate the effectiveness of the fuzzy inference system in making more appropriate criteria for comparing the criticality of each failure mode, we should study the cases with the most changes in the initial RPN number. To do this, both RRPN and MRPN values should be normalized. The normalization range here is 1–100, depending on the numbers available. From Table 12 and Fig. 8, the most changes in the order of increasing priority occurred in F15, F14, and F17. These failure modes do not have very large RRPNs, but their severity value is high. Therefore, the fuzzy inference system has led them to increase priority. This indicates the correct operation of the modifier system.

Fig. 8
figure 8

Comparative chart of "RPN" change (Influence of RPN Modification)

Time and cost constraints prevent us from implementing all corrective and preventive actions (C1 to C41 in Table 10). Therefore, we need to prioritize them. High-priority solutions will be actions that can prevent more hazardous failure modes. Table 11 shows that the most important actions to maximize customer (including faculty, assistants, and students) satisfaction during an online test (containing C33, C19, C36, C16, and C9), the exact expressing of the expectations in the test and evaluation methods, holding the exam with sufficient supervision at the right time and place, designing exam questions and key by the instructor him/herself and also, the existence of appropriate infrastructure, can prevent potential problems in an online test.

It is also emphasized that, since this method is based on FMEA, the provided solutions have a preventive aspect, leading to a reduction in adverse effects in emergencies such as the recent pandemic of the COVID-19.

5 Limitations and future scope of the work

Because in this study, all surveys are based on crisp numbers, there may be some deviation in the conclusions. Subsequent studies using fuzzy logic (which is closer to the human mental structure in terms of ambiguity and psycholinguistics) could yield better results.

In the implementation of the second part of the algorithm, the surveys were conducted only for the students of Sharif University of Technology (which is an engineering university). Further studies at various universities, including all four departments of Mathematical and Technical Sciences, Medical Sciences, Humanities and Arts, will have a significant impact on the comprehensiveness of the results.

Using different methods of data mining and data processing, such as AHP, ANP, and DEMATEL, can be very helpful in better analyzing the results.

6 Conclusion

The COVID-19 pandemic and the need to adhere to health protocols, including avoiding crowded gatherings, have led to a sudden and growing demand for online college classes. The assessment process is one of the most important components of any academic course, especially when a crisis exists. Because of Time and cost constraints, implementing all proposed solutions is impossible and makes it necessary to prioritize them.

In this study, a fair online exam is defined as a test that leads to customer satisfaction (including faculty, assistants, educational system, and students). Then, in analogy to an engineering product, the product design process is performed on it. As the first stage, the FMEA process, which is a preventive method in identifying potential failure modes, is employed to find the potential failure modes, their severity, occurrence, and preventive detection method. Then, the risk priority number of each case is calculated. The K-means method, which is an unsupervised clustering algorithm, has been used to eliminate or minimize the effects of conflicting conditions in assigning occurrence-related numbers. The results show the effectiveness of these two modifications on determining the risk priority of failure modes. Therefore, the QFD algorithm was used to determine the weight of each solution and prioritize its application by considering the proposed solutions as technical characteristics of an engineering product.

The results show that if the taught topics and exam titles are consistent, the instructor's expectations of the students are clear, there is a clear assessment policy, the test is held under adequate supervision at the right time and place, and with the appropriate infrastructure, the test questions are designed by the instructor him/herself, the maximum satisfaction of the stakeholders will be obtained. According to the provided definition, it will lead to an increase in the validity of the online test.